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Abstract 

Code metrics are easy to define, but not so easy to justify. It is hard 
to prove that a metric is valid, i.e., that measured numerical values 
imply anything on the vaguely defined, yet crucial software properties 
such as complexity and maintainability. This paper employs statistical 
analysis and tests to check some "believable" presumptions on the 
behavior of software and metrics measured for this software. Among 
those are the reliability presumption implicit in the application of any 
code metric, and the presumption that the magnitude of change in a 
software artifact is correlated with changes to its version number. 

Putting a suite of 36 metrics to the trial, we confirm most of the 
presumptions. Unexpectedly, we show that a substantial portion of 
the reliability of some metrics can be observed even in random changes 
to architecture. Another surprising result is that Boolean-valued met- 
rics tend to flip their values more often in minor software version 
increments than in major increments. 

keywords metrics; reliability; software architecture 



1 Introduction 

1.1 Metrics' Reliability 

Software metrics are considered [?] an important tool of software engineering. 
However, just as with any other kind of measurement, software metrics are 



1 



subject to reliability and validity concerns. The main cast of doubt in using 
metrics is that of validity; "What do these numbers mean?" , "how do they 
reflect on quality?" , "complexity?" , are typical questions that one would ask 
when bombarded with a list of metric values. 

In contrast, the issue of reliability, even if occasionally mentioned [?, ?, ?, 
?], was largely dismissed [?, ?, ?, ?, ?, ?]. Reliability concern is believed to 
be associated more with psychological and biological measurements, such as 
performance in an IQ test or human evaluation of an X-ray scan — in which 
measurement errors are inherent. However, people compute software metrics 
without worrying about reliability — after all, how can the computation of 
the number of lines of code yield an inaccurate result? 

One of the presumptions this work examines is the hidden presumption of 
absolute reliability in measurement and ensuing evaluation of software. We 
maintain that such measurement, per se, is accurate, and of course do not 
argue e.g., that quantum phenomena may inject errors into the underlying 
computations; nor do we care here about defects in the implementation of 
metric algorithms. However we suggest that in the face of software changes, 
the method of "instantaneous capture" applied in the computation of metrics, 
should be scrutinized. The motivation should be clear: long gone are the days 
that software was produced, frozen, and then used without any subsequent 
changes. With the increasing shift to agile development process [?], changes 
to software are becoming even more frequent, and hence metric values have 
shorter lives' spans. 

Say that a certain class's Depth in Inheritance Tree is 3 in version 20.0- 
bll of a certain software artifact, then, some natural questions to ask are 
whether this value is good or bad, how it reflects on maintainability, whether 
it can be used to predict correctness, etc. All these questions belong in the 
validity domain. One can also ask whether there is a trend of increasing 
or decreasing of the Depth in Inheritance Tree metric, should this trend 
be encouraged or discouraged, whether the measured value is reflective of 
the entire artifact, typical of artifacts of this kind, etc. However, there is a 
more fundamental reliability question to ask: is this measurement stable with 
respect to natural evolution of the software? In other words, one may ask 
what is the likelihood of finding a different value in version (say) 20.0-blla of 
the said artifact. There is little point in making any conclusions regarding any 
measurement if this measurement is subject to random fluctuations during 
software evolution. 

This observation brought us to investigating additional presumptions, 
that relate to changes in software size, correlation between changes in soft- 
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ware versions and magnitude and nature of the changes in the code of the 
software artifacts. 

Our study employed 36 different code metrics, selected from several in- 
dependent collections of metrics. We organized these metrics in a taxonomy 
whose main groups are: marker (or Boolean) metrics, local numerical met- 
rics and global numerical topological metrics. To understand better what we 
mean by "topological" metrics, recall that programs can be readily repre- 
sented as a directed graph of classes, packages, or other modules, which can 
then be subjected to graph theoretical algorithms [?, ?, ?]. 

To check the different presumptions, we used a large corpus of software 
versions, and applied the same set of metrics to each version. We then asked 
whether the results are reliable, i.e., whether the values obtained in a certain 
version are predictive of the values in the subsequent version. We further 
investigated how changes in version size and number are correlated with the 
metrics. Finally, we examined how some presumptions change for different 
groups of metrics. 

An intriguing finding of this work is that a substantial portion of the 
reliability of the global metrics can be observed even if random perturbations 
are applied to the architecture. This means, in a sense, that these metrics 
do not capture an inherent architectural property of the software. 

Another interesting result is that marker metrics tend to change less in 
major version increments and more in minor version changes. This may mean 
that major version releases are more stable and carefully organized than the 
minor ones. 

1.2 Platitudes 

The "reliability" presumption, which we were able to partially confirm, is just 
one of many hidden presumptions, or platitudes, as we shall interchangeably 
refer to these henceforth, regarding software evolution. On course of our 
study, we were able to examine several of these, refuting some, and confirming 
the others. 

The list of these platitudes is as follows: first, everyone knows that 

software comes in many different sizes (size-variety) 

There is an almost universal agreement on a Dewey like version numbering 
scheme, and people tend to believe that 

changes to major version number are correlated with the magnitude of 
the software change (major- changes- large) 
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It is likewise common knowledge that 



software changes fall in a continuous range between the evolutionary end, 
at which most meaningful properties are preserved, and the antipodal rev- 
olutionary end (evolutionary-revolutionary-spectrum) 

The concepts of revolutionary and evolutionary changes may seem amor- 
phous. However, one may think of restructuring an existing software system 
to fit a model-view-controller pattern [?] as a revolutionary change, and 
of adding new encryption method to a banking system as an evolutionary 
change. 

Further, it is plausible to assume that 

most releases of new software versions are evolutionary (mostly- 
e volut ionary ) 

and that 

revolutionary changes tend to coincide with changes to major version 
number (revolutionary-changes-in-major- versions) 

We may also subscribe to beliefs regarding the kind of changes. One would 
tend to think that 

additions to an existing software body tend to follow existing style; more 
so with evolutionary changes (preservation-of-style) 

Also, it is believable that 

even large changes to software tend to leave substantial isolated portions 
of the code unchanged (locality-of-change) 

And, of course, the tacit reliability presumption that we begun with is: 

metrics are reliable (metrics-reliability) 



Our results confirmed most of the presumptions. (For example, we found 
that reliability of final or abstract is typically close to 100%.) But, a 
number of very "believable" presumptions, including (locality-of-change) 
were refuted. 
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1.3 Contributions 



The main contributions of this paper are: 

1. Raising the somewhat less visited issue of software metrics reliability. 

2. An introduction of a taxonomy of code metrics. 

3. The discovery of similarity in many of the properties of metrics in each 
group. 

4. A systematic application of statistical methods to confirm (or refute) 
presumptions on software. 

5. The revelation that local metrics are highly reliable. 

6. The discovery that although global metrics tend to be reliable, much 
of this reliability is due to the limited scope of changes. 

7. The discovery of the surprising fact that local metrics tend to change 
more often in minor version changes. 

8. The revelation that local metrics are 99% reliable. 

9. The discovery of the link between the ranking imposed by numerical 
global metrics and the topological architecture, i.e., software properties 
which can be inferred by examining the structure of the software graph, 
but without using any semantical information. 

10. The discovery that although global types tend to be reliable, much of 
this reliability is due to the limited scope of changes. 

Outline. 

The remainder of this article is organized as follows. The data corpus and 
the way it was selected are described in Section |5J This section also discusses 
the presumption of (size- variety). Section [3]then analyzes the size changes of 
software artifacts present in the corpus, examining presumptions (locality-of- 
change) , (evolutionary-revolutionary-spectrum) , (mostly-evolutionary) 
and (major-changes- large). This section takes an intermission to remind 
the reader of Kendall's tau correlation coefficient and its use as an indicator 
of similarity between rankings. We will use this coefficient later also in the 
analysis of numerical metrics. 
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Section H] presents our metrics suite and the way it was selected, and our 
metrics taxonomy. In Section [5] we study the reliability of the metrics that 
yield Boolean values, discussing the (mostly-evolutionary), (evolutionary- 
revolutionary-spectrum) and (preservation-of-style) presumptions. These 
three presumptions are revisited in Section [6] which presents reliability results 
of numerical metrics. The analysis that shows that at least part of the relia- 
bility of global numerical metrics cannot be attributed to inherent "software 
architecture" is presented Section [7J 

Related work is the subject of Section [HI while Section concludes and 
suggests directions for further research. 

2 Software Corpus 
2.1 Artifacts 

The software corpus used in our experiments comprised 19 software artifacts, 
all drawn from the Qualitas Corpu^ a colossal collection of Java software 
that is being used extensively in many empirical software engineering stud- 

These artifacts included: the Java compiler, javac, ant (Java's equiva- 
lent of make), and junit (the Java unit testing library), Eclipse's JDT core, 
search, and SWT, FreeCol (a simulation game), Antlr (a framework for con- 
structing compilers, interpreters, etc.) hibernate (a persistence framework), 
holds (a relational database engine), j graph (a graph drawing package), 
log4j (the logging component of Apache), struts (the Apache framework 
for the creation of web applications), weak (data mining and machine learning 
software), argouml (an UML diagramming application), hsqldb (hyper SQL 
database engine), jhotdraw (java GUI framework for technical and struc- 
tured graphics), jung (framework for modeling, analysis and visualization of 
graphs), and proguard (java shrinker, optimizer, obfuscator and preverifier). 

^ee the Qualitas Research Group, Qualitas Corpus http://www.cs.auckland.ac- 
nz/~ewan/ corpus 

2 See http://www.cs.auckland.ac.nz/~ewan/corpus/publications.html for a par- 
tial list. 
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2.2 Versions 



For each artifact, we analyzed a number of versions from the corpus. We 
harvested all available versions of each of the artifacts, omitting only three 
versions in which global renaming made it difficult to automatically trace 
classes of previous versions. In total, our corpus comprised 95 versions. 

The essential size characteristics of the corpus are summarized in Ta- 
ble 12.11 The corpus totaled some 78 thousands types, organized in 5,500 
packages. In agreement with (size- variety), the number of types in the ver- 
sions selected in the corpus spans two orders of magnitude (42 through 6,444), 
with a median and average of a few hundreds of types. A similar variety is 
observed in the number of packages. 

Size Metric Mean Median Min Max Total 

Types 822±1, 125 420±285 42 6,444 78,099 

Packages 58±98 23±13 3 469 5,500 

Edges 3,767±4,910 2, 069±1, 437 77 27,764 357,897 

Table 2.1: Size characteristics of the software corpus. 

Each software version was modeled as a directed graph, in which types 
serve as nodes, and edges lead from a type to all types which it uses directly, 
i.e., inheriting from it, declaring a variable of it, invoking one of its methods, 
etc. Edges leading to outside the artifact, e.g., the edge that leads from 
almost every JAVA class to java. lang. Object, were ignored. The number 
of edges thus found is shown in the last row of the table. Not surprisingly 
(size- variety), we see a two orders of magnitude variety in the number of 
edges as well. 

Table 12.11 introduces a ± notation that embellishes the mean with the 
standard deviation, e.g., the mean number of types is 822 (averaged over 
all 95 software versions), while the standard deviation is 1, 125. Similarly, the 
median is embellished with the median absolute deviation (M.A.D.), defined 
as the median of the absolute deviations from the median of the distribution. 

The large standard deviation and the wide range of values are not 
surprising — software varies greatly in size. For this reason, we prefer the 
median and the M.A.D. as a pair of summarizing statistics over the mean 
and standard deviation. Admittedly, the median and the M.A.D are less 
efficient statistical measures than the mean and the standard deviation, but 
they are robust to outliers, which are unavoidable with this great variety. 
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2.3 Pairs 



Our study of software change was carried by organizing the 95 versions in the 
corpus in an ensemble of 76 pairs of subsequent versions of the same artifact. 

Some statistics of software growth and the extent of preservation in the 
pairs of the corpus are shown in Table 12.21 



Metric 


Mean (%) 


Median (%) 


Min (%) 


Max (%) 


Types 


137±64 


112±11 


100 


517 


Edges 


146±92 


118±17 


90 


712 


Remaining Types 


91±16 


97±3 


19 


100 


Continuing Types 


75±23 


79±17 


11 


100 


Remaining Edges 


86±18 


94±6 


23 


100 


Continuing Edges 


71±26 


76±20 


6 


100 


Unchanged Types (outgoing) 


17±7 


17±5 


1 


36 


Unchanged Types (incoming) 


16±7 


17±4 


1 


36 


Unchanged Types (both) 


16±7 


17±5 


1 


36 



Table 2.2: Growth and changes in consecutive artifact versions. 



Table 12.21 should help us appreciate the magnitude of changes and the 
extent of preservation in the version pairs used in our corpus. On average, 
the number of types increased by 37% and the number of edges by 46%. 
Again, we observe a wide spectrum of changes, e.g., in one of the pairs, the 
number of edges increased by a factor of 7. There were even cases in which 
the number of edges decreased, probably thanks to code refactoring which 
reduced coupling between types. 

This great variety can be interpreted as a supportive indication of 
(evolutionary-revolutionary-spectrum). Furthermore, the fact that in the 
first two lines of Table 12.2} the median is smaller than the mean, is consistent 
with the presumptions that most changes are evolutionary, and that evolu- 
tionary changes typically incur smaller size changes. However, this raw data 
does not provide sufficient grounds for the correct placement of any given 
pair between evolutionary and revolutionary extremes. 

The next group of rows in Table 12.21 shows the statistics of the ratio of 
types (resp. edges) that are common to both versions of a pair, compared 
to the total number of types in earlier version (remaining) and the later 
version (continuing) . We have that the mean fraction of remaining types 
is 91%, while the median fraction is 97%. The fact, recurring across the 
entire group, that the median is greater than the mean, is, again, consistent 
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with the presumptions that most of the pairs represent a more evolutionary 
change, in which most of the types remain in the subsequent version. The 
fewer pairs which represent revolutionary changes bring the mean lower than 
the median. 

In concentrating on the median, we see that typically about 3% of types 
and 6% of edges are lost with the release of a new version, and about 80% of 
the types and edges in the version existed in the previous one. 

Note that it could be the case that some of the types which were marked 
as removed by our analysis tools, were simply renamed. The extent of this 
analysis error is bounded above by the (small) number of removed types. 

Finally, the last rows of the table summarize the percentage of the classes 
for which none of the incoming (outgoing) were changed during the evolution 
process. We learn that about 17% of the relations to- or from- types stay 
unchanged. 

We can say that the functionality of one in six types does not change, at 
least in the sense that the set of other types it uses does not change. Also, 
one in six types does not change its duty in two subsequent versions of an 
artifact, at least as far this duty is judged based on the set of other types 
it serves. Conversely, (locality-of-change) is not confirmed by these results, 
changes to software typically border with 5 out of 6 types. 

To summarize, the typical topological change between two subsequent 
versions of a software artifact is characterized by: 

1. a preservation of almost all types (3% are lost); 

2. a preservation of almost all edges (6% are lost); 

3. a preservation of the locale of about one sixth of the types; 

4. an increase of about 10% in the number of types; and, 

5. an increase of about 20% in the number of edges. 

3 Size Changes of Artifacts in the Corpus 

3.1 Correlating Magnitude Changes with Version 
Number Changes 

Our study of the correlation between magnitude changes and version number 
changes, begins with the introduction of a notion of version number change 
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cardinality, which is assigned to each pair of artifact versions by comparing 
the version numbers ( clS clSSl gned by the artifact numbering scheme) of the 
pair members: a cardinality of l/2 n ~ 1 is associated with a change to a n th 
level version number. Thus, the cardinality of a change to the major version 
number is 1; a change to the second level version number, (e.g., versions 1 .3 
and 1.4) has cardinality 1/2, etc. 

Our ensemble comprised 12 pairs of change cardinality 1, 36 pairs of 
cardinality 1/2, 19 pairs of cardinality 1/4, 8 pairs of cardinality 1/8, and 1 
pair of cardinality 1/16. 

Figure [37T1 now depicts the distribution of relative changes in the number 
of edges and types in the corpus' pairs. Each circle in the figure corresponds 
to a pair in the corpus; larger circles corresponding to more cardinal changes. 





Types 



100% 200% 300% 400% 500% 600% 

Larger circles denote more cardinal version changes 

Figure 3.1: Distribution of relative changes in the number of edges and the 
number of types. 

We do not know whether the more modest changes are evolutionary, that 
is whether these changes tend to preserve existing properties. However, the 
picture depicted in Figure 13.11 is at least consistent with (evolutionary- 
revolutionary-spectrum) and (mostly-evolutionary): in most pairs, the 
increase in the number of types (edges) is modest; notwithstanding, a non- 
vanishing number of the pairs exhibit substantial increases to the number of 
types (edges). 

Are the more drastic changes linked to the publication of major new 
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editions of the software? It is difficult to confirm or refute (major- changes- 
large) by visual inspection of Figure I3.1[ in trying to determine whether the 
larger circles tend to show on the left or on the right of the figure. Instead, 
we shall describe an analytical method for studying this correlation using 
Kendall's tau coefficient [?]. 

3.2 Kendall's Tau Coefficient and Statistical Test 

Kendall's tau correlation coefficient gives a measure of the agreement between 
two different rankers of the same data set. Given is a set of elements, and 
their relative ranking by two rankers. Then, the coefficient is defined based 
on two values, n c , and n^, where n c is the number of concordant pairs, i.e., 
pairs of elements whose relative ranking by the two rankers are ordered in 
the same way, e.g., both rankers agree that the first element is "better" than 
the other, and where is defined similarly as the number of discordant 
pairs. The simplest definition of the coefficient is n c — divided by the total 
number of pairs (that is Q) > where n is the number of elements in the ranked 
set). 

The coefficient can be used for measuring the agreement of the ranking of 
a certain metric in two versions of the software. It assumes its maximal value 
of 1 in the case of full agreement, and its minimal value of -1 is achieved in 
the case of total disagreement. The set for comparison is that of the types 
which occur in both versions. 

Thus, Kendall's coefficient is similar to Pearson correlation, except that it 
is non-parametric, rendering it applicable to our "change cardinality" ranking 
(which is ordered, but has no obvious, non-arbitrary mapping to numerical 
values) with magnitude of change. 

Herein, we used a version of the coefficient denoted r& which deals with ties 
(e.g., two version pairs with the same cardinality of change). It is computed 
as 




where n c and are as before, and where s[, . . . , s' k , are the size of the equiv- 
alence classes in the input set under one ranker (application of the metric to 
one version of artifact) while s", . . . , s'y, are the size of the equivalence classes 
in the input set under the other ranker (application of the metric to another 
version of the artifact). 
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Kendall's tau coefficient is a non-parametric test, i.e. it does not depend 
on the actual values but only on their relative ranks. Thus, if some experi- 
ment was to suggest that the logarithm of the metric value should be used 
instead of the metric value itself [?], the test results would not change. 

The underlying statistical test assigns a statistical confidence level to each 
value of r b . 

3.3 Statistical Test of (major-changes-large) 

In comparing the ranking of the pairs by the (relative) magnitude of change 
and the change cardinality we found that r& = 0.35 (resp. = 0.30) when 
the size of the increase is measured in types (edges) with p-value < 0.001. 

To understand why the visual inspection of Figure 13.11 does not readily 
yield the correlation we anticipated, consider the following intuitive (but not 
entirely exact) interpretation of Kendall tau's coefficient. For types, we have 
that r = 0.35. Then, the probability of a pair (of software artifacts) of being 
concordant is p = (1 + r)/2 = 67.5% ~ 2/3. In other words, if two circles 
of different size are selected at random from the lower part of Figure 13. 1[ 
then with probability 2/3 the larger circle will fall to the right of the smaller 
circle, as opposed to 1/2 probability when there cardinality of version change 
is uncorrelated with magnitude. Now, the difficulty experienced in the visual 
inspection is probably explained by the difficulty of distinguishing between 
probability 2/3 and 1/2, for such pairs, a difficulty aggravated by the fact 
that 31% of the pairs are of circles of the same size. 

Our values of Tf, were computed across all pairs, ignoring the concern 
that increment to the second level version number in one artifact may be as 
drastic as a major version number increment in another. In restricting the 
comparison to versions of the same artifact, we found even higher values that 
are statistically significant, e.g., r& > 0.60. (Notwithstanding, artifacts with 
small number of versions did not yield statistically significant values.) 

3.4 Characteristic of Change 

We now present a topological breakdown of changes to the software graph. 
This breakdown will be used below (Section [7]) to guide the generation of a 
random mutation of a given software version, and for examining the reliability 
of metrics against these mutations. 

Fix a pair of two consecutive versions of an artifact. Then, three kinds of 
types can be distinguished: core types are those that are present in both ver- 
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sions, removed types are those that are present in the early version but not 
in the later version, and, conversely, new types are those that are present 
in the newer version only. Also, changes to edges can be further charac- 
terized: removed edges are either core/core, core/removed, removed/core or 
removed/removed. Added edges are either core/core, core/new, new/core, or 
new/new. Preserved edges are always of the core/core kind. 

Table [37T1 summarizes the statistics of the changes according to this break- 
down. All values in the table are obtained by first normalization of the abso- 
lute numbers and then computing the median. Normalization of the absolute 
number of edges (or types) was with respect to the number of edges (or types) 
in the early version. 



Edges Kind Edges (%) From Types (%) To Types (%) 



Core/Core (preserved) 


94±6 


95±4 


64±12 


Core/Core (added) 


4±2 


9±6 


8±4 


Core/Core (removed) 


3±2 


8±5 


6±3 


Core/New 


3±3 


7±6 


7±5 


New/Core 


9±8 


13±12 


13±8 


New/New 


7±7 


19±17 


14±13 


Core /Removed 


0±0 


0±0 


0±0 


Removed /Core 


1±1 


2±2 


3±3 


Removed /Removed 


0±0 


1±1 


1±1 



Table 3.1: Breakdown of added and removed edges in the corpus (median 
values, normalized) 

For example, the first row of the table shows that 94% of the edges be- 
tween core types of an early were preserved when progressing to a newer 
version. Those edges originated (had them as sources) in 95% of the types; 
as their targets the edges used 64% of the types in an early version of software 
artifact. 

The mid section of Table 13.11 reveals an interesting (but not too surpris- 
ing) property of the "graph cut" separating the old and the new portions of 
software: the largest bulk of added edges are those that connect newly in- 
troduced types to core types. Edges in the opposite direction — leading from 
core types to newly introduced types — are rare. The second largest bulk of 
added edges are among the newly introduced types. 
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4 Metrics and Their Taxonomy 



There are hundreds if not thousands metrics described in the literature. We 
could not test them all. However, we tried to cover a variety of metrics and 
give ample consideration to the most popular ones. This section describes 
the 36 software metrics used in our experiments, and proposes a taxonomy 
of metrics of this sort. 

4.1 Criteria for Classification of Software Metrics 

Given is G, the directed graph of software system, where each node v repre- 
sents a module of this system, and an edge e(s, t) leads from a source node s 
to a target node t if type s uses type t. A metric then is a function /iq 
(or just u if G is clear from the context) that assigns a value u(v) to each 
node v e G. 
Metric nature. 

If u(v) depends solely on the topology of G, we say that fi is topological. 
In contrast to topological metrics stand semantical metrics whose value takes 
into account a deeper analysis of the node contents (by e.g., examining the 
code in this node), and the sort of the edges incident on it (by e.g., distin- 
guishing between different kinds of dependencies among nodes). The suite 
includes 17 semantical metrics. 
Metric directionality. 

The dual of a (topological) metric hg is a metric u' G , defined by n' G (v) = 
fic{v) where G' is the graph obtained from G by inverting the direction of 
all edges in it. Thus, metrics fii and /i2 are duals if /xi computed in G is the 
same as u 2 computed in G' . A metric is undirected if it is the dual of itself; 
it is otherwise directed. Our metrics suite includes 18 directional metrics and 
18 unidirectional metrics. 
Metric scope. 

Another criterion for classification is whether u(v) depends on G in its 
entirety, rather than on a restricted neighborhood of v. We say that a metric 
is strictly local if u(v) does not change with changes to G that preserve 
incoming and outgoing edges to v (along with the identity of the nodes at 
the other end of these). In other words, metric a is strictly local if a(v) 
depends solely on v and its neighbors. Also, u is local, if for every v e G 
there is a set of nodes S C G, such that u(v) does not change despite arbitrary 
changes to G, as long as the nodes S U {v} and the edges among these are 
intact. 
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For example, the widely studied Chidamber and Kemerer (CK) suite [?] 
has a number of strictly local methods, including Number of Children (NOC) 
and Coupling between Object Classes (CBO), which is defined as the number 
of types whose methods may be invoked in response to call to the methods of 
a given type. The Depth in Inheritance Tree (DIT) metric however is local, 
but not strictly local. 

Obviously, local metrics are more suited to the study of a single type, or 
a small portion of the code; this kind of metrics is not expected to be telling 
much of the architecture. 

Overall, we have 14 local metrics. A subcategory of local metrics (10 
metrics in our suite) is that of internal metrics; a metric /i is internal if n(v) 
depends only on v. A local metric does not make sense unless it is semantical. 
Weighted Methods Per Class (WMC) [?] , is an example of an internal metric. 

A metric which is not local is global, e.g., the PageRank metric mentioned 
above is global. 
Metric range. 

Our fourth criterion for classifying metrics is based on the type of val- 
ues they yield; continuous metrics (e.g, PageRank) yield real values, while 
discrete metrics (e.g., CBO, NOC, and DIT) yield integers, typically drawn 
from a small range, say o(|G|). We have 3 continuous metrics, and 19 discrete 
metrics. The remaining metrics belong to a special kind of discrete metrics 
henceforth called markers, which yield Boolean-, that is true- or false-, values. 

4.2 Metrics Used in the Experiments 

Table 14.11 enumerates the metrics used in our experiments, classifying these 
according to this taxonomy. 

Marker Metrics. 

The first fourteen metrics in the table are markers: final, abstract and 
interface are simply the Java class attributes with the same name. 

Next comes a group of four topological metrics. The sink marker is 
assigned to types from which a bottom-up study of a software system may 
start since they are referred by any other type in the system (either directly 
or indirectly). Conversely, the source marker is for types from which a top- 
down study may start. The balloon marker (so named after balloon types [?]) 
is for types which have only one client, i.e., nodes whose in-degree is 1. And, 
the wrapper marker is just the opposite — nodes whose out-degree is 1. 

Following that, we have a group of micro-patterns markers [?]. For this 
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work, we carried out measurements on seven of these. 
Chidamber and Kemerer Metrics. 

The next six metrics are all semantical, undirected, discrete, and local; 
they were all drawn from Chidamber and Kemerer's suite [?], including the 
metrics described above, together with Response For a Class (RFC), which 
is the number of methods that can potentially be executed in response to an 
invocation of a method in the type. 

The WMC metric was computed by using the total number of instructions 
in this method as method complexity. In addition to these basic metrics, 
we included a variant of DIT, Number of Ancestors (NOA) which seems 
appropriate for the inheritance structure of interfaces and classes in JAVA. 
Of this suite [?], the Lack of Cohesion (LOC) metric was not included in our 
study. 

Plain Topological Metrics. The next local metric is #Incoming, which 
counts the number of immediate clients a type has. (Of course, this metric is 
related to sink and wrapper metrics.) In contrast, # Clients is a global metric 
defined as the total number of clients of a type, including both immediate 
and non-immediate clients. 

# Outgoing and ^Descendants are the dual of these two, counting the 
number of types that a given type uses directly and indirectly; observe that 
^Descendants is identical to Page- Jones and Constantine's [?, Chap. 9] en- 
cumbrance metric, which, according to the first author of this book, is in- 
dicative of the "sophistication" of a type, its role and may even be predictive 
of its fate. 

Strongly Connected Components Metrics. 

The next group of metrics is computed from the directed acyclic graph 
of strongly connected components of G. Recall that there is a directed path 
between any two nodes that reside in the same strongly connected compo- 
nent; this theoretical structure of a graph makes sense in a software context 
since all types in such a component are interdependent, and hence should 
probably be studied together. A strongly connected component thus may be 
thought together of as super module. In our suite, SCCSize represents the 
size of this super module (i.e., the size of the strongly connected component) 
that a type belongs to. #SCCIncoming and #SCCClients are, respectively, 
the number of super-modules immediate and indirect clients that the super 
module serves. Their duals are #SCCOutgoing and #SCCDescendants. 
Dominators Tree Metrics. 

The penultimate metrics group is computed from the dominators tree 
of G. Recall that a node r dominates a node v, if the only way of getting 
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from into v is through r, and that there is an edge in this tree if r is the 
"most immediate" dominator of v. Thus, the dominators tree is likely to 
identify pivotal points of the software system. From this tree we compute the 
#DominatedBy metric which is the number of nodes that dominate this node, 
the #DominatorHeight, which is the height of the node in the dominators 
tree, and #DominatorWeight, giving the number of nodes that a given node 
dominates. 
Other Metrics. 

In the last group of metrics in Table 14.11 we have PageRank and Be- 
tweenness, yet another measure of graph centrality [?]; roughly speaking, 
nodes that occur on many shortest paths connecting other nodes have higher 
Betweenness value than those that do not. 

The last metric in the table is Belonging used, e.g., in JDepencH and in 
SA4f| which estimates the extent by which a type belongs to its package by 
dividing the number of edges it has (both incoming and outgoing) to other 
types in the package by the total number of edges incident on the type. 



5 Marker Metrics 

Table I5TT1 gives the essential statistics of the prevalence of the marker metrics 
in the suite. 

The first group of markers in the table are JAVA type attributes. As we 
see, about 4% of all types are abstract, about 10% are interfaces, and 15% 
are final. The variance of the final attribute is greater than that of abstract, 
with some versions not using it at all, while others using it for almost half 
of the classes. Later we will see that this simple architecture discovery hints 
are surprisingly reliable. 

The next group in Table 15.11 is of topological markers. The prevalence of 
sinks is low, but still could be explanatory of the architecture of the under- 
lying software. About one in three types is a source in our corpus. This is 
explained by the large number of frameworks and libraries in our data set. 
The resilience of these two markers to the teeth of time tends to be high, as 
we shall see shortly. 

Types with only one client (balloons) are quite popular (10%), but much 
more popular are wrapper types, i.e., those types make use of only one type 
in the artifact. 

3 http : / / clarkware . com/ software/ JDepend . html 
4 http : //www. alphaworks . ibm. com/tech/ sa4j 
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Micro patterns are the third group in the table. Of these, it is surprising 
to see that almost one third of the types in the corpus are stateless. Even 
though 10% of the types in the corpus are interfaces, about 20% of real classes 
are stateless, i.e., do not manage state variables at all. 

Overall, we see that there is a substantial variety in prevalence of each of 
metrics. The prevalence of pure types, for example, ranges between 0.8% and 
21.2%. Even the smallest standard deviation, 0.4% for the function pointer 
micro pattern, is large compared to its 0.2% average prevalence. 

5.1 Reliability 

The top part of Figure 15.11 depicts the reliability values of marker metrics 
in the corpus: columns correspond to the metrics, while each circle on a 
column corresponds to a certain pair of consecutive versions. The circle 
height represents the reliability of the marker metric in this pair, that is, the 
portion of types that preserve this marker as the software evolves from the 
earlier to the later version. 

We see that reliability is generally high; In considering, for example, the 
final marker, we observe that (i) in the vast majority of pairs, fewer than 
5% of all final classes lose this property as the software evolves, but still, 
(ii) there are pairs in which the loss of the final property occurs in 30%, 
50% and even 100% of the cases. 

More generally, we have that (i) in the majority of pairs, marker metrics 
are extremely reliable (the median reliability is always 95% or higher, and is 
greater than 99% for all but two metrics); however, (ii) in a non-negligible 
number of pairs, a large portion of the types lose their marking. 

The bottom part of Figure 15.11 depicts similar information as the top, 
except that it pertains to the negation of marker metrics, e.g., the first column 
of circles in the figure represents the relative number of classes that were not 
final in the earlier version, but became final in the later version. Even 
though negations are more reliable, phenomena (i) and (ii) can still be 
discerned. 

The numerical results support these observations: the median reliability 
is always 99% or higher; it is around 100% for the vast majority of metrics. 
This, together with the fact that the mean reliability is 92% or higher, and 
it is most often greater than 98% confirm (metrics-reliable). 

Also, together, (i) and (ii) confirm (mostly-evolutionary). Presump- 
tion (evolutionary-revolutionary-spectrum) is confirmed by visual exami- 
nation of the spectrum of reliability values in each of the 28 columns present 
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Larger circles denote more cardinal version changes; triangles denote the mean 
minus standard deviation barrier. 

Figure 5.1: Reliability of marker metrics (top) and their negation (bottom) 
in the corpus. 

on Figure 15.11 there is no obvious dichotomic discretization of these spectra. 

One attempt of setting a border between the evolutionary and revolution- 
ary ends is depicted in the figure: the triangles in each column denote the 
mean minus standard deviation level, but this discretization does not seem 
to be superior to others. But, the partitioning between evolutionary and 
revolutionary ends offered by this line seems arbitrary, e.g., in the wrapper 
column, there does not seem to be a good reason to place the border at 70% 
rather than at 83%. 

The size of circles in Figure 15.11 denotes change cardinality. We expected 
larger circles to fall in the revolutionary end, but this is not to easy to 
confirm visually. Instead, for (major-changes-large), we computed of the 
reliability values and the pair's change cardinality for each of the marker 
metrics (and their negations). Similarly to the testing in Section 13.2} this 
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computation was carried out for the entire ensemble, but also separately for 
the set of versions of each of the artifacts in the ensemble. 

Rather surprisingly, the results revealed a positive correlation of reliability 
values and cardinality. In other words, we found that marker metrics tend to 
change less in major version increments and more in minor version changes. 

The specifics follows: half of the r& values for the entire ensemble 

were statistically significant (p- value less than 0.05); in all of these T& was 
positive, ranging between 0.13 and 0.33. Further, even all non-significant 
values (for the entire ensemble) were positive (with the sole exception of the 
interface marker metric in which r& = —0.03). The same happens for Ty, 
values computed within each artifact. All statistically significant values are 
positive, ranging from 77, = 0.49 and = 1. (Again, artifacts with a small 
number of versions did not usually have statistically significant results). 

5.2 Style Preservation 

Having studied changes to existing types in our corpus, we now turn to 
checking whether newly added classes tend to follow the "style" of existing 
software body, and whether this tendency is correlated with the cardinality 
of the change. The difficulty in testing (preservation-of-style) is that the 
arsenal of standard statistical tests is good at showing that a set of values 
does not follow a given (null-hypothesis) distribution, but usually falls short 
of showing the inverse — that the values' set indeed obeys a given distribution. 

We applied the standard x 2 test for each of the marker metrics and each 
of the pairs in the corpus (total of 76 x 14 tests) to determine whether there is 
any statistical difference in the prevalence of the metric in the earlier software 
version and its prevalence among newly added types. 

Statistically significant values of the x 2 value were found in only 25.5% of 
the tests, and, in confirmation of (preservation-of-style), these significant 
changes to the prevalence are correlated with the cardinality of the change. 
This correlation is not so strong, r = 0.09, but it is statistically significant 
(p-value < 0.01). 

Does the addition of new classes change the overall prevalence of any of 
the metrics? Our finding indicate that this rarely happens. Table I5T21 depicts 
the main statistics of changes to the prevalence in the corpus. 

We see that in most metrics, the average change in the prevalence is 
about 0.02% (and is always less than 1.6%). Similarly, in most metrics the 
median change is about 0.01% (and is always less than 0.7%). However, it 
would be incorrect to say that the prevalence never changes drastically in 
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consecutive versions of software. Examining the "min" and "max" columns 
in the table, we see that the difference in prevalence can be almost fifty 
points in both directions. Even in the function pointer micro pattern, in 
which the extreme values are small, they are not negligible when compared 
to the typical occurrences of this micro pattern. As predicted by (mostly- 
evolutionary), there are few cases of revolutionary changes, in which the 
prevalence changes by as much as 50%. 

6 Numerical Metrics 

Reliability of marker metrics was defined simply as the portion of classes that 
retained the marker during the change. 

A straightforward extension of this definition to numerical metrics leads 
to bogus results, since even an addition of a single type may change the metric 
value of all types. The reason is that the precise values of numerical metrics, 
even local ones, are highly sensitive to change. For example, introducing a 
new root to the inheritance hierarchy will change the DIT metric of all types. 
Our empirical findings showed that progressing to the next software version 
changed the PageRank of 99% or more of the types, and the WMC value of 
the 36% of the types. 

We therefore use a more robust definition of reliability which relies on 
Kendall's r b (see Section 13.21 above) coefficient of correlation to compare 
the relative ordering of values that a metric yields. The computation of Tj, 
is (naturally) done only for the types which are present in both versions. 
However, the value of ma Y depend on types which only occur in Gj. 

High values of Ty, imply high reliability. For example, if r& = 0.9 for a 
certain numerical metric /i and a certain pair of versions (G,G*), then the 
implication 

Hg(u) > fx G (v) => ii G *{u) > fx G *(v) 

holds for 95% of types u, v G G n G*. 

We thus computed the values of r& for all metrics. This computation was 
restricted to consecutive pairs only for two reasons: first, the reliability value 
in moving from version i to version i + 2 can be broken down to, at least 
mentally, to two factors: that of the progression from version i to version 
i + 1 and that of the progression from version % + 1 to version % + 2. Second, 
the consideration of all pairs biases our sample towards artifacts with more 
versions. 
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As it turns out, the number of types involved made in computing r b , was 
so large (recall Table I2.ip that all values thus computed were statistically 
significant with high confidence level (p- value < 0.001. 

Table lBTTl presents the essential statistics of these values for local numerical 
metrics. 

Most obviously, all values are high. In fact we have that the average 
value of Tb, computed across all metrics and all pairs, is 0.93, in confirma- 
tion of (metrics-reliable). We also see that the median is greater than the 
mean, confirming (evolutionary-revolutionary-spectrum) and (mostly- 
evolutionary) . 

Similar finding are exhibited by Table 16.21 which repeats Table 16.11 for 
global metrics. Interestingly, the mean and median values of all metrics are 
quite similar. Still, in comparing Mean, Median, and Min columns in the 
two tables we see that global metrics are (slightly) less reliable than local 
metrics. 

Figure 16.11 is the equivalent of Figure 15.11 for numerical metrics. Examin- 
ing the figure, we see again the spectrum of software changes ((evolutionary- 
revolutionary-spectrum)) with many more changes at the evolutionary end 
((mostly-evolutionary)), at which reliability is high ((metrics-reliable)). 
Again, a visual inspection is not sufficient to confirm (revolutionary- 
changes-in-major- version), i.e., that the more radical changes of the rank- 
ing offered by the metric tend to occur with more cardinal version number 
increments. 

Thus, as before, to confirm (revolutionary-changes-in- major- version), 

we computed for each of the metrics the coefficient of correlation between the 
ranking defined by the change cardinality number and the metric's reliability 
value. This was done for the entire ensemble. As expected (and in contrast 
of what we found with marker metrics), the correlation was negative: higher 
reliability of numerical metrics tends to coincide with more minor increments 
of the version number. Specifically, all values for the ensemble were signif- 
icant and negative, and all values for a specific artifact which were significant 
were also negative. (Values of r b ranged between —0.30 and —0.84, but were 
typically about —0.70.) 
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Larger circles denote more cardinal version changes; triangles denote the mean minus standard 



deviation barrier. 

Figure 6.1: Reliability of numerical metrics in the corpus 

7 Understanding Reliability of Global Nu- 
merical Metrics 

How can the high reliability values of global numerical metrics be explained? 
The answer that we would like to have is: "these metrics indeed capture the 
essence of a software architecture; their preservation indicates that architec- 
ture is persistent" . 

Unfortunately, as it will become clear at the end of this section, this 
answer is only partially correct. Much of the high agreement of the ranking 
is explained by the limited scope of changes to software between versions. 
Even random mutations of the software graph reach the same high values. 
In other words, most of the reliability of numerical metrics does not capture 
"architecture" in any deep way, and is in essence a reflection of the fact that 
there are many types and edges that continue to the next edition of a software 
artifact. Interestingly, we will see that there exists a residual reliability which 
is well explained by the proviso that the relative rankings offered the global 
metrics suite capture hidden architectural traits. 

Simple-minded, random mutation. Let G be a graph of a certain 
version of a software artifact and let G* be the graph of the successive version. 
Then, instead of comparing G with G* as we did before, we shall compare G 
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now with a mutation graph M = M(G), generated by random mutation 
of M, which has the same number of types and edges as G*. 

Table 17.11 shows the advantage of the reliability values found above over 
random, yet simple minded, graph mutations. 

In the experiments, we computed, for each of the metrics, the reliability 
of the metric in the pair (G,G*) and subtracted from it the reliability of this 
metric in the pair (G, Mo) where graph Mq was generated by adding to G 
random nodes and edges necessary to make it as large as (occasionally as 
small as) G*. The results were then summarized in the second table column. 

The third table column was computed in a similar fashion, except that it 
uses a variant method of computing the mutation M : instead of growing G, 
we randomly shrink G* to obtain M , and used the pair (M , G) instead 
of(G,M). 

Comparing both the second and third columns of this table with the high 
values found in Table 16.21 shows that these high numbers are not so telling. 
Half to two thirds of the apparent agreement between metric values of two 
versions of the software is found in totally random mutations. 

We observe still that the agreement between metric values in a real suc- 
cessor is always better than that of a random mutation. Having this happen 
consistently in 17 metrics cannot be a mere coincidence. We have that with 
statistical significance of a < 0.001 or better, the agreement of metrics rank- 
ing is not a matter of pure chance. 

Topological mutations. The randomness allowed in the above mutations 
allowed situations which are unlikely to occur in the life-cycle of software. 
For example, in selecting edges in a complete random fashion, the number of 
edges between the existing nodes and the new nodes would be much smaller 
than in the real new version graph G*. We ask now whether there exists a 
more structured mutation that can yield the same reliability values as found 
in actual software evolution. 

The five mutations presented next try to imitate the topology of the 
changes to a software graph. All of these mutations start with the original 
graph G and apply two preliminary transformations to this graph: First, all 
edges and nodes present in G but not in G* are removed. Second, all nodes 
present in G* are added. 

These deterministic transformations create a graph which (i) has all the 
core nodes, (ii) has all the preserved core/core edges, and (Hi) has the same 
number of nodes as G*. The duty of a subsequent random mutation is to 
add new edges to this transformed graph so that it has the same number of 
edges as G*. 
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Recall now our categorization of edge kinds and their fate (Section l3.4p . 
Edges in graph G* are of the following five kinds: core/core (preserved), 
core/core (added), core/new, new/core and new/new. Our construction of 
the transformed graph is such that the first kind exists in it. All mutations 
in our experiments add the correct number of missing edges of each of the 
four remaining kinds. 

The difference between the mutations is in the way the source and target 
nodes are selected for edges in each of these categories. We use three different 
policies. 

1. Same means that the source and the target are not selected at random; 
we simply copy the edge from G* to the mutated graph; 

2. Random means that the source and the target are selected at random 
from the sets of all nodes in the corresponding group. 

For example, if we apply this policy of selection to generate random 
core/new edges, then every such edge connects a random node in the 
core with a random edge in the new set. 

3. Random/boundary is similar to the Random policy, except that the 
source and the target are selected at random from the more restricted 
set of nodes which served in G* as source or target for the corresponding 
kind. 

For example, in applying this policy of selection to generate random 
core/new edges, every newly created edges connects (i) a node selected 
at random from the set of core nodes with an edge in G* leading to a 
new node with a (ii) a node selected at random from the set of new 
nodes with an incoming edge in G* starting at a core node. 

Table 17721 uses these policies to describe the mutations we apply. Columns 
of the table describe the locus of mutations: the core locus refers to added 
core/core edges; the cut locus refers to the core/new and new/core edges; 
and, the new locus refers to new/new added edges. 

Mutation Mi is the simplest; it is much like Mq described above, except 
that it maintains the balance of edge groups. Mutation M 2 imitates slightly 
better a real software version, in that it uses the Random/boundary policy for 
edge selection. 

Mutations M3, M4 and M5 were designed with the objective of under- 
standing better which graph locus contributes more to the agreement of 
metric values: 
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• Mutation M 5 is almost identical to G*. The difference is only in a 
random selection of edges in the cut graph locus (and, even these 
edges are not entirely random; they only connect nodes which were 
adjacent on the cut in graph G*). 

• Mutations M3 and M4 are similar to M5 except that in M3 there are 
random changes to the core locus and in M4 there are random changes 
to the new locus. 

Results. Figure I7TT1 summarizes the median of the reliability values calcu- 
lated for the pair (G, G*) and (G, Mi) for i — 0, 1, . . . , 5. 




Figure 7.1: Median value of reliability of global metrics across consecutive 
versions of software artefact 's and random mutations M , . . . , M 5 . 

The top most line in the figure, labeled "real" designates the high agree- 
ment values found in the (G, G*) pairs. The other lines correspond to the 
mutations. We see that the random and not so structured mutations Mq 
and Mi explain about 2/3 of the high values of agreement. Furthermore, 
even though M\ is more structured, it is inferior to M in at least several 
metrics. A substantial improvement occurs when we move to M 2 in which 
we select an edge to connect two random "portal" nodes. 
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The reliability values of (G, M 5 ) are almost exactly the same as those of 
real software versions, i.e., of (G,G*). This is not very surprising, since the 
graph M5 is different from G*- only in one locus. 

Now both M 3 and M 4 are different from G* in two loci. We expect M 3 
to agree with G better than M4 agrees with it. The reason is that the 
agreement between metric rankings is compared only at core nodes. As 
indicated by Table \7.2\ the edges of M 3 at the core are exactly the same 
as in G*. We expected mutation M 3 to yield good reliability values: after 
all, in this mutation randomness was limited to the new locus, which is the 
farthest from the core. 

What is surprising though is that mutation M 4 (in which the core/core 
added edges are random) approximates a real software version almost at the 
same level of agreement as M 3 . Put differently, the new and the core loci 
have the same impact on reliability. 

Analysis We have thus observed that the particular way in which real 
evolution of software "selects" edges in the new locus has a substantial 
and measurable impact on the reliability of global numerical metrics. This 
observation is consistent with (preservation-of-style). 

More importantly, the dual of this observation tells us that our suite of 
global numerical metrics is sensitive to additions to the new locus. Only 
if these additions are done in a manner "consistent" with evolution of real 
software, reliability is preserved. 

Perhaps the simplest explanation of this observation and its dual is in 
the claim that this suite reflects an underlying software architecture. If the 
introduction of edges to the new locus is consistent with the underlying 
architecture, rankings of nodes as defined by the suite will not change much. 
In contrast, random additions to the new locus, which are inconsistent with 
the architecture, will perturb these rankings. 

8 Related Work 

A measure of the relative importance of components within the software 
structure was examined in [?]. The authors suggested to use CodeRank, 
the software equivalent of Google's well known PageRank [?] method for 
ranking web pages, metric to indicate how important a specific component 
is based on its coupling to the rest of the system. In an earlier work [?], 
for the same purpose the authors suggested to use a similar metric called 
Component Rank. The main difference between these metrics is that the 
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CodeRank is computed based on the weighted graph that represents various 
usage relations between the components and the number of time each usage 
occurs. 

Lorenz et al. [?] recommend using a wide range of metrics to test the 
quality of models, classes and methods. Various metrics related to coupling, 
inheritance and size of classes and methods play the major role in deducting 
the quality of the software. However, the reliability of those metrics is not 
analyzed by the authors. 

Lajios et al. [?] investigated the correlation of various software metrics to 
the defect found in software modules and proposed an approach to determine 
a sets of metrics for quality assessment of complex software systems. First 
they calculated various quantitative, complexities, coupling and other metrics 
at the class level for several similar projects using different open source tools. 
Then they found the correlation of these metrics to the history of bugs using 
machine learning techniques. They found that although some of the metrics 
are more suitable for the assessment of software quality, these metrics differ 
between the analyzed projects even though their natures are similar. They 
also discovered that 5 out of 11 metrics were irrelevant for the analyzed 
systems. 

Ordonez et al. [?] examined various metrics used in software industry 
to measure code size and design complexity. They mentioned that NASA 
used the first five metrics presented in [?] in the tool they developed for 
analyzing source code with respect to its architecture. The author's analysis 
was focused on how reliable are specific software modules with respect to 
their maintainability and the probability of causing defects. They, however, 
did not explore whether the metrics themselves were reliable. 

Kitchenham et al. [?] compared the ways the axioms sets are derived 
and used in mathematics with those used in software metrics research. The 
authors claim that the use of axioms for measurement of size and complexity 
concepts is not mature enough and that there is a non-negligible risk that 
using axioms to validate software metrics may reject a valid measure or accept 
an invalid one. 

The issue of validity was more frequently discussed in the community 
than reliability. For many years, researchers argued (see e.g., [?]) that it is 
very difficult to come up with a solid proof that any external metrics and 
measurements of software, such as Halstead's software science metrics [?], 
and even cyclomatic complexity [?] or even size, pertains to more interesting, 
internal properties, such as maintainability, stability, etc. The community 
therefore resorts to convincing argumentation, often backed by mathematical 
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arguments [?], case study analysis [?, ?], etc. [?]. 

Mahmoud et al. [?] investigated the logical stability of object-oriented 
designs. They computed the correlation between Chidamber and Kemerer's 
metrics [?] and the likelihood of classes to stay change-prone as a consequence 
of changes made to other classes in the design. They analyzed a list of design 
and class-level changes, and investigated how changes in one class affect 
others. They showed that five out of six metrics were negatively correlated 
with the logical stability of the classes. The authors performed their analysis 
on a relatively small set of subject systems and on a single version of each 
system. They focused on local metrics without analyzing the actual changes 
made on systems as they are being developed. 

Fenton et al. [?] investigated the metric based software defect prediction 
models and suggested that various size and complexity metrics can not serve 
as good predictors to software defects. They criticized the approaches that 
used some of the metrics covered in this work with respect to defect pre- 
diction. However, they did not question the reliability of the metrics with 
respect to the design of the software. Furthermore, the changes of metric val- 
ues and the number of defects over time were excluded from their prediction 
model. 

Emam et al. [?] argued that the validity of object-oriented metrics is 
questionable as most of them are indifferent to the size of the software. They 
examined Chidamber and Kemerer [?] metrics as well as Lorenz and Kidd [?] 
metrics and showed that their correlation with fault-proneness is similar to 
their correlation with the number of source code lines. Therefore they claimed 
that any analysis of metrics for software artifacts should be "normalized" 
with respect to their size. Evanco [?] criticizes the statistical analysis pro- 
posed by Emam and claims that the model suggested by the authors fails to 
provide useful information as to the effect of the size of the code. In our work 
we covered software artifacts of various sizes and found that some metrics 
were reliable even when the size of the same software artifact increased five 
times between two consecutive versions of the same artifact. 

9 Conclusions 

We presented a metrics suite comprising 36 code metrics drawn from var- 
ious independent sources. Our taxonomy of metrics included a distinction 
between semantical and topological metrics, a breakdown by directionality, 
and range of values yielded by the metric. Our study did not reveal sub- 
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stantial differences between semantical and topological metrics. Also, the 
distinction between directional and uni-directional metrics did not translate 
into different properties of the metric. We did not identify any meaningful 
distinction between discrete and continuous metrics, even though some of 
the metrics with discrete values assumed only 4 or 5 different values. 

Curiously, even though the metrics originated from different sources and 
described by different authors, all metrics in the same group had essen- 
tially the same behavior: for example, reliability of all marker metrics was 
about 99%, and reliability of all (but one) local metrics was about 93%. 

Most of the presumptions presented in Section [1] were confirmed. Excep- 
tions were: (locality-of-change), for which we found that 5 out of 6 types in- 
cluded at least one changed type in their neighborhood, and (revolutionary- 
changes-in-major-versions) whose opposite was confirmed with respect to 
local metrics. The presumption (metrics-reliability) was not confirmed for 
all the analyzed metrics: marker metrics were reliable. For numerical metrics, 
our experiments showed that reliability was negatively correlated with scope: 
internal metrics, i.e., metrics which depend only on a certain class were ex- 
tremely reliable; local metrics which depend on a class and its neighbors were 
slightly less reliable; global or topological metrics were unreliable. 

Presumption (preservation-of-style) was confirmed under an implicit in- 
terpretation of the term "style" as the prevalence of marker metrics. In a 
sense, Section [7J tried to explore this presumption from the point of view of 
global numeric metrics. It was shown in this section that the interconnection 
between newly added types have a strong impact on the ordering of global 
numeric metrics computed at the core types. 

Further research should probably focus on the link between numerical 
metrics and topological architecture as implied by the phenomena shown in 
Section 
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Tdble 4.1: Metrics used in experiments dnd their cdtegories. 
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Metric 



Mean (%) Median (%) Min (%) Max (%) 



final 


15.0 ± 15.1 


7.3 ±6.9 


0.0 


48.5 


abstract 


4.6 ±2.4 


4.1 ± 1.8 


0.8 


11.8 


interface 


10.1 ±5.2 


8.1 ±4.2 


1.7 


21.2 


sink 


1.3 ±2.2 


0.6 ±0.6 


0.0 


16.7 


source 


27.7 ± 16.4 


29.5 ± 13.7 


1.0 


55.3 


balloon 


10.8 ±8.9 


7.2 ±4.6 


1.1 


42.1 


wrapper 


25.7 ± 7.1 


23.9 ±3.0 


12.0 


49.5 


pure 


8.9 ± 5.0 


8.1 ±3.7 


0.8 


21.2 


pool 


1.4 ± 1.2 


1.0 ±0.6 


0.0 


5.9 


designator 


0.4 ±0.6 


0.2 ±0.2 


0.0 


4.3 


function pointer 


0.2 ±0.4 


0.0 ±0.0 


0.0 


2.2 


stateless 


28.7 ±8.8 


29.3 ±4.9 


9.8 


53.0 


sampler 


0.9 ±0.7 


0.8 ±0.3 


0.0 


3.2 


canopy 


17.1 ±9.1 


15.9 ±7.8 


3.4 


47.6 



Table 5.1: Essential statistics of the prevalence of the marker metrics in the 
suite. 

Metric Mean (%) Median (%) Min (%) Max (%) 



final 


0.01±10.39 


0.00±0.93 


-46.95 


48.54 


abstract 


0.02±1.36 


-0.03±0.19 


-6.48 


5.60 


interface 


-0.01±2.23 


-0.05±0.50 


-10.63 


7.70 


sink 


-0.21±1.84 


0.00±0.13 


-9.71 


5.44 


source 


1.57±7.13 


0.66±1.04 


-29.81 


35.52 


balloon 


-0.99±3.20 


-0.19±0.67 


-20.09 


6.80 


! balloon 


-0.99±3.20 


-0.19±0.67 


-20.09 


6.80 


wrapper 


-0.17±4.33 


0.01±0.97 


-18.66 


24.49 


pure 


0.01±2.19 


-0.04±0.47 


-10.63 


8.12 


pool 


0.21±1.00 


-0.01±0.12 


-1.92 


5.85 


designator 


0.03±0.57 


0.00±0.01 


-1.92 


4.26 


function pointer 


0.01±0.09 


0.00±0.00 


-0.33 


0.37 


stateless 


0.69±3.75 


0.10±1.10 


-8.13 


15.58 


sampler 


-0.00±0.51 


-0.02±0.08 


-2.19 


1.60 


canopy 


0.09±5.09 


-0.10±1.30 


-13.93 


26.65 



Table 5.2: Changes in prevalence of marker metrics in consecutive versions 
of software artifacts. 
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Metric 


Mean 


Median Min Max 


DIT 


0.95 ± 0.08 


0.98 ±0.02 0.55 


1.00 


NOA 


0.96 ± 0.06 


0.98 ± 0.02 0.70 


1.00 


NOC 


0.94 ± 0.09 


0.97 ±0.03 0.34 


1.00 


CBO 


0.93 ± 0.07 


0.95 ±0.04 0.59 


1.00 


RFC 


0.93 ± 0.07 


0.94 ±0.04 0.58 


1.00 


WMC 


0.93 ± 0.07 


0.94 ± 0.04 0.56 


1.00 


^Incoming 


0.94 ± 0.08 


0.96 ± 0.04 0.52 


1.00 


# Outgoing 


0.93 ± 0.06 


0.94 ±0.04 0.67 


1.00 


Belonging 


0.87 ±0.13 


0.88 ±0.07 0.22 


1.00 



Table 6.1: Essential statistics of r& of local metrics across consecutive versions 
of software artifacts. 



Metric Mean Median Min Max 



#Clients 


0.93 


± 


0.09 


0.96 


± 


0.04 


0.55 


1.00 


^Descendants 


0.90 


± 


0.10 


0.93 


± 


0.07 


0.63 


1.00 


#SCCIncoming 


0.90 


± 


0.12 


0.93 


± 


0.06 


0.48 


1.00 


#SCCClients 


0.92 


± 


0.10 


0.94 


± 


0.05 


0.52 


1.00 


#SCCOutgoing 


0.90 


± 


0.12 


0.93 


± 


0.06 


0.48 


1.00 


#SCCDescendants 


0.90 


± 


0.10 


0.93 


± 


0.07 


0.64 


1.00 


SCCSize 


0.89 


± 


0.12 


0.93 


± 


0.07 


0.59 


1.00 



#DominatedBy 


0.87 ± 


0.17 


0.89 ± 


0.09 


0.05 


1.00 


#DominatedBy' 


0.8, 


3± 


0.15 


0.93 ± 


0.06 


0.21 


1.00 


^DominatorHeight 


0.87 ± 


0.13 


0.90 ± 


0.08 


0.33 


1.00 


^DominatorHeight' 


0.8; 


3 ± 


0.15 


0.91 ± 


0.07 


0.10 


1.00 


^Dominator Weight 


0.8' 


7± 


0.13 


0.89 ± 


0.08 


0.35 


1.00 


^Dominator Weight ' 


0.8; 


3± 


0.15 


0.91 ± 


0.07 


0.08 


1.00 


PageRank 


0.93 ± 


0.08 


0.94 ± 


0.05 


0.50 


1.00 


PageRank' 


0.8; 


3± 


0.10 


0.89 ± 


0.07 


0.56 


1.00 


Betweeness 


0.8' 


9± 


0.11 


0.91 ± 


0.07 


0.41 


1.00 


Betweeness' 


0.8; 


3± 


0.12 


0.91 ± 


0.08 


0.35 


1.00 



Table 6.2: Essential statistics of r b of global metrics across consecutive ver- 
sions of software artifacts. 
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Metric 


Median (Grow' 


Median (Shrink) 


# Clients 


0.35±0.24 


0.34±0.21 


^Descendants 


0.46±0.22 


0.46±0.20 


#SCCIncoming 


0.34±0.26 


0.35±0.23 


#SCCClients 


0.32±0.22 


0.37±0.21 


#SCCOutgoing 


0.34±0.26 


0.35±0.23 


#SCCDescendants 


0.41±0.23 


0.40±0.22 


SCCSize 


0.44±0.26 


r\ A A \ r\ c\-\ 

0.44±0.21 


#DominatedBy 


0.50±0.27 


0.43±0.25 


#DominatedBy' 


0.35±0.27 


0.38±0.24 


^DominatorHeight 


0.45±0.24 


0.36±0.23 


T^DominatorHeight ' 


0.33±0.23 


0.28±0.20 


#Dominator Weight 


0.44±0.24 


0.35±0.23 


#Dominator Weight' 


0.31±0.24 


0.27±0.19 


PageRank 


0.23±0.15 


0.21±0.14 


PageRank' 


0.30±0.15 


0.28±0.15 


Betweeness 


0.39±0.14 


0.35±0.15 


Betweeness' 


0.38±0.14 


0.34±0.16 



Table 7.1: Difference between reliability of global metrics across consecutive 
versions of software artifacts and reliability computed in a random graph 
growth and shrink. 





Core 


Cut 


New 


Mi 


Random 


Random 


Random 


M 2 


Random/boundary 


Random/boundary 


Random/boundary 


M 3 


Same 


Random/boundary 


Random/boundary 


M 4 


Random/boundary 


Random/boundary 


Same 


M 5 


Same 


Random/boundary 


Same 



Table 7.2: Mutations imitating a subsequent software version 
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